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Abstract — We show a generic describing method of hardware 
including a memory access controller in a C-based high-level 
synthesis technology (Handel-C). In this method, a prefetching 
mechanism to improve the performance by hiding memory 
access latency can be systematically described in C language. 
We demonstrate that the proposed method is very simple and 
easy through a case study. The experimental result shows 
that although the proposed method introduces a little hardware 
overhead, it can improve the performance significantly. 

Index Terms — high-level synthesis, memory access, data 
prefetch, FPGA, Handel-C, latency hiding 

I. Introduction 

The system-on-chip (SoC) used for the embedded 
products must achieve the low cost and respond to the short 
life-cycle. Thus, to reduce the development burdens such as 
the development period and cost for the SoC, the high-level 
synthesis (HLS) technologies for the hardware design 
converting the high abstract algorithm-level description like 
C, C++, Jave, and MATLAB to a register transfer-level (RTL) 
description in a hardware description language (HDL) have 
been proposed and developed. In addition, many researchers 
in this research domain have noticed that the memory access 
latency has to be hidden to extract the inherent performance 
of the hardware. 

Some researchers and developers have proposed the 
hardware platforms combining the HLS technology with an 
efficient memory access controller (MAC). The MACs as a 
part of the platforms at high-level description such as C, C++ 
and MATLAB have been shown [1, 2, 3]. In these platforms, 
the designer has only to write the data processing hardware 
with the simple interfaces communicating with the MACs in 
order to access to the memory. However they did not consider 
hiding the memory access latency. 

Ref. [4] describes some memory access schedulers built 
as hardware to reduce the memory access latency by issuing 
the memory commands efficiently and hiding the refresh cycle 
of the DRAM memory. This proposal hides only the latencies 
native to the DRAM such as the RAS-CAS latency, the bank 
switching latency and the refresh cycle. Thus, this scheduler 
cannot hide the application-specific latency like the streaming 
data buffering, the block data buffering, and the window 
buffering. Some HLS tools [5, 6, 7, and 8] can describe the 
memory access behavior in C language as well as the data 
processing hardware. Ref. [5, 6, 7] however have never shown 
a generic describing method to hide memory access latency. 
Thus, the designers must have a deep knowledge of the HLS 
tool used and write the MAC well relying on their skill. Ref. 
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[8] needs the load/store unit described in HDL which is a tiny 
processor dedicated to memory access. Thus, the high-speed 
simulation at C level cannot be performed in an early stage 
during the development period. We propose a generic method 
to describe the hardware including a functionality hiding 
application-specific memory access latency at C language. 
This paper pays attention to the Handel-C [9] that is one of 
the HLS tools that has been widely used for the hardware 
design. 

To hide memory access latency, our method employs the 
software-pipelining [10] which is widely used in the research 
domain of the high performance computing. The software- 
pipelining reconstructs the programming list by copying the 
load and store operations in the loop into front of the loop 
and into back of the loop respectively. Since this method is 
very simple, the user can easily use it and describe the 
hardware with the feature hiding memory access latency. 
Consequently, the generic method of the C -level memory 
latency hiding can be introduced into the conventional HLS 
technology. 

Generally, the performance estimation is performed to 
estimate the effect of the software pipelining. The 
conventional method [10] uses the average memory latency 
for the processor with a write-back cache memory. For a 
hardware module in an embedded SoC, such cache memory 
is very expensive and cannot be employed. Thus, new 
performance estimation method is needed. Thus, we propose 
the new estimation method considering of the hardware 
module to be mounted onto the SoC. The rest of the paper is 
organized as follows. Section 2 shows the target hardware 
architecture. Section 3 describes the load/store functions in 
Handel-C, Section 4 describes the templates of the memory 
access and data processing. Section 5 demonstrates the 
software -pipelining method to hide memory access latency. 
Section 6 explains new method of the performance estimation 
based on the hardware architecture shown in Section 2, 
considering the load and store for the software-pipelining. 
Section 7 shows the experimental results. Finally, Section 8 
concludes this paper and remarks the future work. 

Program in Hand^l-C" 
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Figure. 1 Target architecture. 

—ACEEE 



ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 



H. TARGET HARDWARE ARCHITECTURE 

Fig. 1 shows the architecture of the target hardware. This 
architecture is familiar to the hardware designers and the 
HLS technologies. This architecture consists of the memory 
access part (MAP), the input/output FIFOs (IF/OF) and the 
data processing part (DP). They are written in C language of 
the Handel-C. The MAP is a control intensive hardware to 
load and store the data in the memory. The MAP accesses to 
the memory via wishbone bus [1 1] in a conventional request- 
ready fashion. The MAP has the register file including the 
control/status registers, mailbox (MB), of the hardware 
module. The designer of the hardware module can arbitrarily 
define each register in the MB. Each register can be accessed 
by the external devices such as the embedded processor. 
The DP is a data processing hardware which processes the 
streaming data. Any HLS technology is good at handling the 
streaming hardware. 

The MAP accesses the memory according to the memory 
access pattern written in C. The MAP loads the data into the 
IF, converting the memory data to the streaming data. The 
DP processes the streaming data in the IF and stores the 
processed data into the OF as streaming data. The MAP 
reads the streaming data in the OF and stores the read data 
into the memory according to the memory access pattern. 
The MAP and the DP are decoupled by the input/output 
FIFOs. Thus, the hardware description is not confused about 
the memory access and the data processing. Generally, any 
HLS technology has the primitives of the FIFOs. 

in. LOAD/STORE FUNCTION 

Generally, the memory such as SRAM, SDRAM and DDR 
SDRAM support the burst transfer which includes the 
continuous 4 to 8 words. Thus, we describe the load/store 
primitives supporting burst transfer as function calls as shown 
in Fig. 2 and Fig. 3 respectively. 

As for the load function shown in Fig. 2, the bus request 
is issued setting WE_0 to in the lines 2-5. When the 
acknowledgement (ACK_I) is asserted by the memory, this 
function performs the burst transfer of loading in the lines 6- 
20. In the Handel-C, "par" performs the statements in the 
block in parallel at 1 clock. In contrast, the "seq" performs 
the statements in the block sequentially. Each statement 
consumes 1 clock. That is, the continuous words in the burst 
transfer are pushed into the input FIFO (IF) one by one. The 
' ! ' is the primitive pushing into the FIFO in Handel-C. When 
the specified FIFO (in this example it is IF) is full, this statement 
is blocked until the FIFO has an empty space. When burst 
transfer finishes, the current address is added by the burst 
length as shown in the line 17 in order to issue the next burst 
transfer of loading. As for the store function shown in Fig. 3, 
this function attempts to pop the output FIFO (OF) to store 
the data processed by the data processing part (DP) in the 
line 2. The '?' is the primitive popping the FIFO in the Handel- 
C. When the OF is empty, this statement is blocked until the 
DP finishes the data processing and pushes the result into 
the OF. 
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l;wid aem_load (uint WORDHIDTH *4ddc){ 
3 ; p j r{CTI_<J"ex 2 ; CVC_0- 1 ; STB J3- 1 ; 



*; 




5: 




6: 


seq(i=e;i<BUK5T_LEH;i++){ 


7: 
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IF ! CAT_Ij //Push Input data to If 
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ill: 


CIT_0=fhc7; //Inform burst end. 
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Figure. 2 Load function. 



l:vmd nem_store (uint HOflD_WIDTH *addr){ 
2: Of ? temp; //Pop OF to temp. 
3: do{ //Bus Request. 

4 : pa.{CTI_u^K2; CVC_0=1 ; STS_0= 1 ; UE_0= 1 ; 

5 : SE L_0°$xf ; «Wt_d* x ad<Jr ; ) 

6; }while(.ft£K_l"0); 

7: rjAI_0»temp; //Output first word, 

B: seqCi=l;i<BU«ST_LE»;i»+){ 

9: par{ 

Ifl: Of 1 DAT_0; //Pop OF to bus output. 

11: if(i==(BURST_LEN-2)) 

12: CIT_0=exr;~V/ triform fcurst end. 

13; /7 Terminate bor^t transfer, 

14: if(l-»(BURST_LEN-l)> 

15: p»r{ 

16 : CI T_0-8 i C VC_O.S; ST6_0=*;WE _Q= &; 

17: SEl_O=0;ADR_O=e; 

IS: //Update load address. 

19-: -3ddrt=BUftsT_LEN'*i0fiEi_5lIE; 

i$: } 

21: } 

22 : } 

23r> 

Figure. 3 Store function 

Then, the bus request is issued setting WE_0 to 1 in the 
lines 3-6. When the acknowledgement (ACK_I) is asserted 
by the memory, this function performs the burst transfer of 
storing in the lines 7-22. During the burst transfer, the OF is 
popped and the popped data is outputted into the output 
port (DAT_0) one by one per 1 clock. As similar to the load 
function, the current address is added by the burst length for 
the next transfer in the line 19. 

i:mld HAP (woi.(*){ 
2; wJiile<l){ 

3: //Waiting for invo-t at ion . 
4: while{MB[e]==e> delay; 
S: p»r{ 

6: MB[e]=&; //fleset start -Flag. 

7: MB[4]-8; //Reset end flag. 

S: read_aifdr =n&[ij; //Get read address, 

9: write_addr=MB[2); //Ge-t write address. 
10: e>nd_adsr -H6[3]j //Ge-t end dddress. 
HI: > 

12; while ( re-ad_3.ddr < end_addr K 
14; me*_loa.d (Sread_addr >; //M« :- 1 - - 
16 1 me«_stor€ ( Swrite_addr ) ; //OF =5- Mem. 
IS; } 

16: MB[4] = 1; //Set end fla B . 
17: > 
IS: J 

Figure. 4 Template of memory access part 
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1 : 

2; 
); 

A ; 

S; 
0: 
J: 

S: 

9: 
nr. 
11; 
12: 
13;} 



veld DP(void){ 
ulnt W0ftO_WlOTK i_dat[&U*ST_L6N]; 
uint W0RD_WIOTH Q_dat[BJURST„LEN] ; 

«hile(l){ 

//Pop from IF into i_dat. 
ieq{i=B;i<auR5T_iEN;U*){IF t i_dat[i];l 
/* Data processi ng using i_dat. */ 

/* Result? are stored into o_dat,*/ 

//Push the processed data into OF. 
seq{i=0;i<B.UF!5T_LEN;i++)iOF ! o_dat[i];) 



:• 



Figure. 5 Template of data processing part 

I; void ma i n { void ) { 

2: par{ 

J: f'iB ( ) ; //Behavior of MB as bus slave. 

maP( ); //Memory access part* 
5r DP ( ); //Data processing, part. 

*>• } 

70 

Figure. 6 Whole of hardware module 



IV. MEMORY ACCESS AND DATA PROCESSING 

By using load/store functions shown in Fig. 2 and Fig. 3, 
the memory access part can be written easily as shown in 
Fig. 4. Since Fig. 4 is a template of the memory access part 
(MAP), it can be modified as the designer's like. The MAP 
cooperates with the data processing part (DP) shown in Fig. 
5. Fig. 5 is also the template. So, the designer must describe 
the data processing in the lines 8 to 10. 

In this example, the mailbox #0 (MB[0]) is used for the 
start flag of the hardware module. The MB [4] is used for the 
end flag indicating that the hardware module finishes the 
data processing completely. The MB[1], MB[2] and MB[3] 
are used for the parameters of the addresses used. By utilizing 
the mailbox (MB), different memory access patterns can be 
realized flexibly. 

When the MAP is invoked, it loads the memory by burst 
transfer and pushes the loaded words into the input FIFO 
(IF) in the line 14. The DP is blocked by the pop statement to 
the IF in the line 7. When the MAP pushes the IF, the DP 
pops the words in IF to the temporary array (i_dat[i]). Then, 
the DP processes the popped data and generates the result 
into the temporary array (o_dat[i]). At the same time, the 

l.void MAP (void}( 

12: m«m_load(Sread_addr); //prologue 

13: krh.ile( read_addr < end_addV ){ //kern*l 

14-. nem_load (4r*ad_addr ); //rte» j if. 

15: mem stDre(6write_addr ) ; / /OF = ">Mem. 
16: } 

17: rw«_stor«(&write_*ddr); //epilogue 
lfl: WS[4] = 1; //Set end flag. 
19: } 
29:} 

Figure. 7 Software pipelining 

MAP is blocked by the pop statement in the line 16 until the 
DP pushes the result data into the OF. When the DP pushes 
the processed data into the OF, the MAP can perform the 
burst transfer of storing. 
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Figure. 8 Execution snapshot 

Until all data are processed, above flow is repeated. When 
the data processing finishes completely, the MAP exits the 
main loop in the lines 12 to 15 and sets the end flag to 1 in the 
line 16. Consequently, the MAP is blocked in the line 4 and the 
DP is blocked in the line 7. Thus, this hardware module can 
be executed with new data. 

Fig. 6 shows how to describe the whole of the hardware 
module. The MB ( ) is the function of the behavior of the 
mailbox as bus slave. Due to paper limitation, its detail is 
omitted. In this program, the MB, the MAP and the DP are 
executed in parallel. 

V. SOFTWARE PIPELINING 

To hide the memory access latency, the software pipelining 
has been applied to the original source code of the application 
program [10]. Our proposal applies the software pipelining to 
the program of the memory access part (MAP) as shown in 
Fig. 4. Fig. 7 shows the overview of the software pipelining 
to the MAP program. 

In the software pipelining, the burst transfer of loading 
(mem_load) and storing (mem_store) in the main loop are 
copied to the front of the main loop and the back of it respec- 
tively. In the main loop, the data used at the next iteration is 
loaded at the current iteration. Thus, the memory accesses of 
the MAP are overlapped with the data processing part (DP). 
In addition, the size of the input and output FIFOs is doubled. 

VI. Performance Estimation 

In the conventional software pipelining, its effect is 
estimated by using the average memory latency [10]. However 
the average memory latency is measured on the processor 
with a cache memory. So, it is not applied to the hardware 
module used in SoCs. The hardware module generally does 
not have a cache memory. So, the memory access latency 
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due to every load and store should be considered. When 
Tdp is the data processing time of the data processing part 
(DP), the normal execution time (Trie) without software 
pipelining can be calculated as following expression. 

Trie = (Tl + Tdp + Ts ) x n. (1) 

where Tl is the load latency, Ts is the store latency and n is 
the number of iterations. The execution snapshot when the 
software pipelining is applied is shown as Fig. 8. The NE 
means the normal execution and the SP means the software 
pipelined execution. The number of each square indicates 
the iteration number. 

As shown in upper side of Fig. 8, if Tdp is greater than equal 
to Tl+Ts, the memory access latency is enough hidden. So, 
the data processing time is dominant of the total execution 
time. In contrast, as shown in lower side of Fig. 8, if Tdp is 
lower than Tl + Ts, the memory latency affects the performance. 
The total execution time comes closer to the memory 
bottleneck. The software pipelined execution time (Tsp) can 
be calculated as follows. 



Tsp 



\Tdp x n + TI + Ts , Tdp > (Tl + Ts ). 
[Tdp + (Tl +Ts)x n,Tdp < (Tl + Ts ). 



(2) 



When the speedup ratio (Tne I Tsp) is greater than 1, the 
software pipelining is effective for the hardware module. 

VIL EXPERIMENT AND DISCUSSION 

A. Performance Evaluation 

In order to confirm the effect of the software pipelining to 
the performance, we have described the whole of hardware 
shown in Fig. 4, Fig. 5, Fig. 6 and Fig. 7 in Handel-C (DK 
Design tool 5.4 of Mentor Graphics). For the data processing 
part in Fig. 5, we insert the delay loop into the lines 8 to 10 as 
a data processing. Varying the number of clock cycles of the 
delay loop, we have measured the execution time by using 
the logic simulator (ModelsimlO.l of Mentor Graphics). In 
addition, we have assumed that a 32bits width a DDR SDRAM 
with the burst length of 4 is used. Its load latency is 8 clock 
cycles and its store latency is 9 clock cycles. The number of 
iterations has been set to 16384. That is, the data size was 
256KR. The clock frennencv has been set to 100MHz. 
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Figure. 9 Performance evaluation 

The experimental result is shown in Fig. 9 which is the break- 
down of the execution time of the data processing part (DP). 
The horizontal axis means the number of clock cycles of the 
delay loop. The vertical axis shows the number of clock cycles 
consumed until the hardware finishes the execution for all 
data. The black bar is the stall time of the DP due to the 
memory latency. The white bar is the clock cycles consumed 
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Figure. 1 1 Total Execution time 

by the other execution except for the memory stall time. Al- 
though the clock cycle of the delay loop is zero, the DP needs 
some clock cycles in addition to the stall time. This is be- 
cause the overhead pushing and popping FIFOs is included. 
This overhead is 6 clock cycles per iteration. The hiding 
memory latency by software pipelining can improve the per- 
formance significantly compared with the normal execution. 

Fig. 10 shows the speedup ratio. The ETne and the ETsp 
mean the calculated normal execution time and the software- 
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pipelined execution time respectively. The Tne and the Tsp 
are measured results. By hiding memory latency, the speed- 
ups of 1.21 to 1.79 can be achieved. In addition, the result 
shows that our estimation method can get the same tendency 
to the measured results. 

Fig. 1 1 shows the results of the estimated execution time 
and the measured execution time. Where the data processing 
time (Tdp) is lower than 77 + Ts =17, the performance is closer 
to the memory bottleneck plus the inherent overhead due to 
FIFO access, by overlapping the data processing onto the 
memory access. Once the Tdp becomes larger than equal to 
the Tl + Ts=l 7, the data processing occupies the performance. 
Thus, the execution time is increasing as the delay loop 
(computation) is becoming larger. 

B. Hardware Size 

By reconstructing the hardware description as shown in 
Fig. 7 for applying the software pipelining, the hardware size 
may increase compared with the normal version. To confirm 
this hardware overhead, we have implemented the Handel-C 
description into the FPGA. The target FPGA is Spartan6 and 
theISE13.1 is used for implementation. The result shows that 
the software pipelined version uses 1 .09 times of logic 
resources than the normal version. The difference between 
clock frequencies of both versions is about 2% only. Thus, 
the hardware overhead due to applying the software 
pipelining is very small and it can be compensated by 
performance improvement. 

vm. CONCLUSION 

We have shown a generic describing method of hardware 
including a memory access controller in a C-based high-level 
synthesis technology, Handel-C. This method is very simple 
and easy, so any designer can employ the memory hiding 
method for the design entry in C language level. The 
experimental result shows that the proposed method can 
improve the performance significantly by hiding memory 
access latency. The new performance estimation can be useful 
because the estimated performances have shown the same 
tendency to the measured results. The proposed method does 
not introduce the significant bad effect to the normal version 
hardware. As future work, we plan to apply our method to 
other commercial HLS tools. Also, we will use more application 
programs and practically integrate hardware modules into a 
SoC. 
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